Branch Predictor for Wide Issue, Arbitrarily Aligned Fetch

ABSTRACT

In an embodiment, a processor may be configured to fetch N instruction bytes from an instruction cache (a “fetch group”), even if the fetch group crosses a cache line boundary. A branch predictor may be configured to produce branch predictions for up to M branches in the fetch group, where M is a maximum number of branches that may be included in the fetch group. In an embodiment, a branch direction predictor may be updated responsive to a misprediction and also responsive to the branch prediction being within a threshold of transitioning between predictions. To avoid a lookup to determine if the threshold update is to be performed, the branch predictor may detect the threshold update during prediction, and may transmit an indication with the branch.

The present application is a divisional of U.S. application Ser. No.13/625,382, filed on Sep. 24, 2012, which is hereby incorporated byreference in its entirety.

BACKGROUND

1. Technical Field

Embodiments described herein are related to the field of processors and,more particularly, to branch prediction in processors.

2. Description of the Related Art

One of the key factors affecting the performance of processors is themanagement of branch instructions (or more briefly, “branches”). Avariety of branch predictors can be used to predict the direction (takenor not taken), the target address, etc. for branches, to allow theprocessor to fetch ahead of the branches. If the predictions arecorrect, the next instructions to be executed after each branch mayalready be preloaded into the processor's pipeline, which may enhanceperformance over fetching the instructions after executing each branch.Similarly, the next instructions can be speculatively executed and thuscan be ready to retire/commit results when the branch is resolved (ifthe prediction is correct), further enhancing performance.

Branch predictors can be accessed in different fashions, depending onhow early in the pipeline the branch predictors are accessed. Generally,the earlier in the pipeline that the predictor is accessed, the lessinformation about the branch is available. For example, if the branchpredictor is accessed in parallel with cache access for a fetch, thebranch predictor can produce a prediction based on the fetch address.However, the location of the branch instruction is unknown and thus thebranch must be located after fetch and the prediction associated withthe branch. If the prediction is not taken, there may be another branchin the instructions fetched which could have been predicted but was notpredicted.

SUMMARY

In an embodiment, a processor may be configured to fetch N instructionbytes from an instruction cache, even if the N instruction bytes cross acache line boundary. A branch predictor in the processor may beconfigured to produce branch predictions for up to M branches in the Ninstruction bytes, where M is a maximum number of branches that may beincluded in the N instruction bytes for a first instruction setimplemented by the processor. In some embodiments, the processor mayalso implement a second instruction set that may include more than Mbranches in the N instruction bytes, but the occurrence of more than Mbranches may be rare. In an embodiment, branch prediction accuracy maybe increased by providing predictions for each branch in the Ninstruction bytes for most cases. The increased branch predictionaccuracy may lead to increased performance.

In an embodiment, a branch direction predictor may be updated responsiveto a misprediction and also responsive to the branch prediction beingwithin a threshold of transitioning between predictions. To avoid alookup to determine if the threshold update is to be performed, thebranch predictor may detect the threshold update during prediction, andmay transmit an indication with the branch. When the branch is executed,the misprediction update may be determined by the branch execution unit.The branch execution unit may transmit an update request responsive toeither a branch misprediction or the indication with the branchindicating a threshold update.

BRIEF DESCRIPTION OF THE DRAWINGS

The following detailed description makes reference to the accompanyingdrawings, which are now briefly described.

FIG. 1 is a block diagram of one embodiment of a processor.

FIG. 2 is a block diagram of one embodiment of a branch directionpredictor.

FIG. 3 is a block diagram of one embodiment of a branch predictor tableshown in FIG. 2.

FIG. 4 is a flowchart illustrating operation of one embodiment of thebranch direction predictor shown in FIG. 2 during prediction of abranch.

FIG. 5 is a flowchart illustrating operation of one embodiment of abranch execution unit shown in FIG. 1 during execution of a branch.

FIG. 6 is a flowchart illustrating operation of one embodiment of thebranch direction predictor shown in FIG. 2 in response to a branchupdate request.

FIG. 7 is a block diagram of one embodiment of a system.

While the embodiments described herein are susceptible to variousmodifications and alternative forms, specific embodiments thereof areshown by way of example in the drawings and will herein be described indetail. It should be understood, however, that the drawings and detaileddescription thereto are not intended to limit the embodiments to theparticular form disclosed, but on the contrary, the intention is tocover all modifications, equivalents and alternatives falling within thespirit and scope of the appended claims. The headings used herein arefor organizational purposes only and are not meant to be used to limitthe scope of the description. As used throughout this application, theword “may” is used in a permissive sense (i.e., meaning having thepotential to), rather than the mandatory sense (i.e., meaning must).Similarly, the words “include”, “including”, and “includes” meanincluding, but not limited to.

Various units, circuits, or other components may be described as“configured to” perform a task or tasks. In such contexts, “configuredto” is a broad recitation of structure generally meaning “havingcircuitry that” performs the task or tasks during operation. As such,the unit/circuit/component can be configured to perform the task evenwhen the unit/circuit/component is not currently on. In general, thecircuitry that forms the structure corresponding to “configured to” mayinclude hardware circuits. Similarly, various units/circuits/componentsmay be described as performing a task or tasks, for convenience in thedescription. Such descriptions should be interpreted as including thephrase “configured to.” Reciting a unit/circuit/component that isconfigured to perform one or more tasks is expressly intended not toinvoke 35 U.S.C. §112, paragraph six interpretation for thatunit/circuit/component.

DETAILED DESCRIPTION OF EMBODIMENTS

Turning now to FIG. 1, a block diagram of one embodiment of a processor10 is shown. In the illustrated embodiment, the processor 10 includes afetch control unit 12, an instruction cache 14, and an execution core16. The execution core 16 may include a branch execution unit 18, andthe fetch control unit 12 may include a branch direction predictor 20.The fetch control unit may be configured to output a fetch group ofinstructions (in the event of a cache hit) to the execution core 16. Thebranch direction predictor 20 may be configured to generate weakprediction update indications for potential branches in the fetch group.The branch execution unit 18 may be configured to execute branches andto generate updates for the branch direction predictor 20. Additionally,the branch execution unit 18 may be configured signal mispredictions(not shown) to the fetch control unit 12 to cause the fetch control unit12 to begin fetching at the correct fetch address based on the branchexecution.

The fetch control unit 12 may be configured to generate fetch addressesby speculating on the instruction stream. Branch prediction is a factorin such speculation. Generally, branch prediction may refer to anymechanism for speculating on the result of one or more aspects of branchinstruction execution. For example, the branch direction predictor 20may predict the direction of a conditional branch (e.g. taken or nottaken). A taken branch causes the next instruction in the instructionstream to be at the target address of a branch (although in someinstruction sets there may be a delay slot instruction that is alsoexecuted sequential to the branch before fetching the targetinstruction). A not-taken branch causes the next instruction to befetched at the sequential address of the branch (i.e. the fetch addressof the branch plus the size of the branch, and again their may be adelay slot instruction). A conditional branch may be determined to betaken or not taken by evaluating one or more conditions specified by theconditional branch. The conditions may be based on a previousinstruction's execution and/or a comparison specified by the branchinstruction, in various instruction sets. The target address of thebranch may be specified by other operands of the instruction. Somebranch instructions may be indirect, where at least one of the operandsspecifying the target address is either a memory location or a registervalue. Other branches may specify the target address directly as adisplacement in the instruction (added to the fetch address) or animmediate field specifying an absolute address. Indirect branches mayhave branch target prediction, where the target address is predicted.

The fetch control unit 12 may, in some embodiments, implement otherspeculative structures for fetch address generation. For example, afetch address predictor may be trained on a speculative instructionstream, and look ups in the predictor may be used for each fetch addressto predict the next fetch address. Various branch predictors, such asthe branch direction predictor 20, may be used to validate the fetchaddress predictor and to train the fetch address predictor. Subsequentbranch execution may validate the branch predictors, and may be used totrain the branch predictors.

The instruction cache 14 may be configured to fetch instructions fromthe fetch address, and provide the instructions to the execution core16. In one embodiment, the instruction cache 14 may be configured toprovide a fetch group of instructions. The fetch group may be defined asa group of instructions beginning at the fetch address. The number ofinstructions or the number of bytes in a fetch group may be fixed, evenif the fetch group crosses a cache line boundary in the instructioncache 14. That is, the instruction cache 14 may be configured to outputinstructions from both the cache line addressed by the fetch address andthe next consecutive cache line, assuming both cache lines are a hit inthe cache. The next consecutive cache line may be the cache line thatwould abut the current cache line in main memory (i.e. the nextnumerically higher fetch address on a cache line granularity). Theinstruction cache 14 may be banked, and the portions of the currentcache line and the next consecutive cache line that may form a fetchgroup are stored in different banks Index generation may also be alteredsuch that the next consecutive index is generated when the fetch groupmay extend across a cache line boundary (or the next consecutive indexmay be unconditionally generated, but may only be used in the cache thatthe fetch group extends across the cache line boundary). While the fetchgroup is fixed in size, the actual number of instructions used from afetch group may vary. For example, a predicted-taken conditional branchor unconditional branch may cause the subsequent instructions from thefetch group to be discarded.

In an embodiment, the branch direction predictor 20 may implement aPerceptron-based prediction scheme. In a Perceptron-based predictor,multiple branch predictor tables may be indexed for a given fetchaddress, and each table may output a branch predictor value. The branchpredictor value may be a signed weight, for example. The branchpredictor value output from each table may be summed together to producea summed branch predictor value. The sign of the sum may be used topredict the direction. In one particular embodiment, the branchdirection predictor 20 may be based on an Optimized Geometric HistoryLength (O-GEHL) predictor. Additional details of one embodiment aredescribed further below.

The O-GEHL predictor may be trained in response to mispredictions, andmay also be trained when the summed branch predictor value is near zero(i.e. within a threshold of zero). For example, the summed branchpredictor value may be near zero when the absolute value of the summedbranch predictor value is less than a specified threshold. The thresholdmay be fixed or programmable, in various embodiments. When the summedbranch predictor value is near zero, the prediction may be susceptibleto changing the direction of the prediction. That is, a small update ofthe predictor values in the other direction may change the sign of thesum and thus the prediction. Accordingly, training on correct predictionwhen the summed branch predictor value is near zero may strengthen theprediction and help prevent the change in direction of the predictionwhen the current direction prediction is correct.

In the illustrated embodiment, the branch direction predictor 20 may beconfigured to transmit a weak prediction update indication for eachbranch prediction. The weak prediction update indication may indicatethat the summed branch predictor value is within a threshold of zero. Ifthe branch prediction is correct, the branch direction predictor 20 maystill be trained to strengthen the prediction. The branch directionpredictor 20 may be configured to make the determination that theprediction is weak (and should be trained on correct prediction) at thetime the prediction is made. The weak prediction update indications maybe transmitted with the branch instructions, and may cause the branchexecution unit 18 to request an update even if the prediction iscorrect. By identifying the weak predictions at the time the predictionis made, and transmitting the weak prediction update indications withthe branch instructions, a branch direction predictor read may beavoided at branch prediction verification/training time if theprediction is correct and the prediction is not weak. The competitionfor access to the branch direction predictor 20 may be reduced. In someembodiments, the branch direction predictor 20 may be implemented with asingle port to the branch prediction tables, which may be shared betweenthe training reads and the prediction reads. The inventors havediscovered that identifying the weak predictions at prediction timerather than verification/training time does not significantly impact theaccuracy of the predictor, and reduces the number of reads of thepredictor.

In one embodiment, the branch direction predictor 20 may be configuredto provide a branch prediction for each potential branch instruction inthe fetch group (and may also provide weak prediction update indicationsfor each potential branch instruction in the fetch group). Accordingly,each entry in the branch direction predictor 20 may store a number ofbranch predictions equal to a number of instructions that may reside inthe cache line. The branch direction predictor may be banked in at leasttwo banks Based on the offset of the fetch address, one or more banks atthe index generated from the fetch address and zero or more banks at thenext index may be read to select the branch prediction values for thefetch group. The branch predictions that actually correspond to branchinstructions in the fetch group may be identified later in the pipelineby offset from the beginning of the fetch group. In an embodiment, theprocessor 10 may be configured to execute two instruction sets. Thefirst instruction set may have fixed length instructions (e.g. 32 bits,or 4 bytes). Each entry of the branch direction predictor may store abranch prediction for each potential instruction in the cache lineaccording to the first instruction set. The second instruction set mayhave both 16 bit (2 byte) and 32 bit (4 byte) instructions. Thus, thereare more potential instructions in the same sized cache line for thesecond instruction set. However, it is infrequent that branchinstructions are adjacent in the instruction stream. Accordingly, thesame set of branch predictors may be used for the second instruction setas are used for the first instruction set. Alternatively, the set ofbranch predictors stored in a given entry may include sufficientpredictors to predict each potential branch instruction in the secondinstruction set as well.

The instruction cache 14 may have any construction, configuration, andsize. For example, the instruction cache 14 may be set associative,direct mapped, or fully associative. Cache lines may be of any size aswell (e.g. 32 bytes, 64 bytes, etc.).

The execution core 16 may be coupled to receive the instructions fromthe instruction cache 16 (and weak prediction update indications fromthe branch direction predictor 20) and may be configured to execute theinstructions. The execution core 16 may include any execution hardware,including circuitry to decode instructions into one or, in someembodiments, multiple ops to be executed. The execution hardware mayfurther include circuitry to perform register renaming (for embodimentsthat implement register renaming). The execution hardware may furtherinclude circuitry to schedule, execute, and retire instructions. Theexecution core 16 may be pipelined and/or superscalar. The schedulecircuit may implement centralized scheduling (e.g. a scheduler thatissues ops and retires ops) or distributed scheduling (e.g. reservationstations for issuing ops and a reorder buffer for retiring ops). Theexecution hardware may include execution units of various types,including the branch execution unit 18. There may be one or more of eachtype of execution unit in various embodiments. The execution hardwaremay include speculative and/or out-of-order execution mechanisms, orin-order execution mechanisms, in various embodiments. The executionhardware may include microcoding, in some embodiments. Any configurationof the execution core 16 may be implemented.

As mentioned previously, the branch execution unit 18 may be configuredto execute branch instructions. Branch instruction execution may includeevaluating a condition or conditions specified by a conditional branch,and determining a taken/not-take result based on the evaluation. If thetarget address is generated using more than one value (e.g. adisplacement and fetch address), the branch execution unit 18 may beconfigured to generate the target address. Any branch predictions may bevalidated. If there is a misprediction, the branch execution unit 18 maybe configured to signal the fetch control unit 12 to begin fetching thecorrect address. There may also be signalling within the processor'spipelines to purge instructions that are subsequent to the mispredictedbranch instruction (and thus are not part of the correct instructionstream based on the branch execution).

Additionally, if the branch direction is mispredicted or the weakprediction update indication associated with the branch indicatesupdate, the branch execution unit 18 may be configured to transmit anupdate request to the branch direction predictor 20. The update requestmay include the fetch address of the branch instruction, and may alsoinclude corresponding branch history used to generate the indexes to thebranch predictor tables. Alternatively, the branch history may be storedlocally by the branch direction predictor 20, or the indexes may becarried with the branches and provided as part of the update request.The update request may further indicate the taken/not taken result and aindication of whether the branch was mispredicted or weakly predicted.

It is noted that, while the branch execution unit 18 is shown as part ofthe execution core 16 in the embodiment of FIG. 1, other embodiments mayimplement the branch execution unit further up the instruction executionpipeline, if desired. For example, the branch execution unit 18 may bepart of the fetch control unit 12, and may receive input conditions fromthe execution core 16 for evaluation of conditional branches.

Turning next to FIG. 2, a block diagram of one embodiment of the branchdirection predictor 20 is shown. In the embodiment of FIG. 2, the branchdirection predictor 20 includes a set of predictor tables 30A-30N, anindex generator 32, an increment/decrement unit 34, a set of P+1 adders36, and a mux 38. The mux 38 is coupled to receive a fetch addressgenerated by the fetch control unit 12 and an update address from thebranch execution unit 18. The mux 38 may be coupled to receive aselection control from the index generator 32, which may be coupled toreceive the selected address from the mux 38. The index generator 32 maybe coupled to receive a fetch address valid (FA valid) signal indicatingwhether or not a valid fetch address is input to the mux 38, and anupdate valid signal from the branch execution unit 18 indicating thatthe update address is valid. The index generator 32 includes historystorage 40 storing various branch history used to generate indexes forthe predictor tables 30A-30N. The index generator 32 is coupled toprovide indexes to read ports (R) on the predictor tables 30A-30N. Thepredictor tables 30A-30N are configured to output branch predictionvalues (BP1 to BPN) to the adders 36 and to the increment/decrement unit34. The increment/decrement unit 34 is coupled to receive an updatetaken/not taken result from the branch prediction execution unit 18. Theincrement/decrement unit 34 is configured to provide the updated branchprediction values to the write ports on the predictor tables 30A-30N.Both M and P in FIG. 2 may be integers greater than zero.

If the fetch address is valid, the index generator 32 may be configuredto select the fetch address for index generation. If the fetch addressis not valid and the update address is valid, the index generator 32 maybe configured to select the update address for index generation. If bothaddresses are valid at the same time, there may be a pipeline stage tocapture the update address (and corresponding taken/not taken result)for update in a subsequent clock cycle.

The index generator 32 may be configured to generate a different indexfor each predictor table 30A-30N. In an embodiment implemented based onO-GEHL, the index for each table may include a different amount ofbranch history (global and path). Specifically, a geometricallyincreasing among of global history and geometrically increasing (butcapped) amount of path history may be used for each successive indexgeneration. In an embodiment, the index for the predictor table 30A maybe generated from only fetch address bits. The index for the predictortable 30B may be generated from fewer address bits, along with someglobal history and path history bits. The index for the next predictortable may be generated from still fewer address bits, along with stillmore global history and path history bits. The number of path historybits may be capped, and there may be a floor to the reduction in thenumber of address bits. Additional details for an embodiment of theindex generation are provided further below. Generally, the globalhistory and path history may be generated from the target addresses oftaken branches. In one embodiment, the global history may be leftshifted and XORed with the target address of the next taken branch. Thepath history may be left shifted by one bit and a selected leastsignificant bit of the target address may be shifted in to the leastsignificant bit of the path history. Alternatively, the path history maybe left shifted by more than one bit and an equal number of bits may beshifted in from the target address of the next taken branch.

The index generator 32 may provide each index to the respective readport of the corresponding predictor table 30A-30N. Each predictor table30A-30N may output a branch predictor value to be used to generate abranch predictor. More particularly, each branch predictor value may bean M+1 bit signed weight. The weights may be added by the adders 36 andthe sign of the resulting sum may indicate the direction prediction(e.g. taken if positive, not taken if negative, or vice versa).

In the illustrated embodiment, each predictor table may be configured tooutput P+1 branch predictor values, corresponding to P+1 potentialbranches in a fetch group. Each of the P+1 branch predictor valuescorresponds to the position of an instruction in the fetch group of thefirst instruction set. That is, there may be P+1 instructions in a fetchgroup. Accordingly, P+1 branch predictions (taken/not taken) may beoutput by the adders 36 by adding the branch predictor valuescorresponding to a given position in the fetch group. The branchpredictions are output by the adders 36 (BP[0 . . . P] in FIG. 2) andmay be associated with the P+1 instructions in the fetch group byposition.

Additionally, as mentioned previously, the branch direction predictor 20may be configured to detect which predictions are weak (e.g. within athreshold of zero) at prediction time. For example, the adders 36 mayinclude circuitry to compare the resulting sums to the threshold value,to generate the corresponding weak prediction update indications. Theindications may be asserted to indicate update (weak prediction) anddeasserted to indicate no update.

The adders 36 may be configured generate predictions in response to afetch address input. If an update address was presented to the indexgenerator 32, the branch predictor values from the tables may bepresented to the increment/decrement unit 34 for update. Theincrement/decrement unit 34 may be configured to either increment ordecrement the branch predictor values corresponding to the branch thatwas mispredicted (or correctly predicted with a weak update indicationasserted). Thus, the offset of the branch within the fetch group may beidentified to identify which branch predictor values to update.

In an embodiment in which the positive sign of the sum indicates a takenprediction, the branch predictor values may be incremented in responseto a taken branch and decremented in response to a not-taken branch.Other embodiments may define a positive sign to indicate not taken, inwhich case the branch predictor values may be incremented in response toa non-taken branch and decremented in response to a taken branch.

The modified branch predictor values (and unmodified values from thesame entry) may be returned to the write port on the predictor tables30A-30N for update into the memory. The index generator 32 may beconfigured to supply the same indexes used for the read to the writeport to update the correct entries.

It is noted that, while the embodiment described above implements boththe weak prediction update indicators transmitted with the predictedbranch instructions and the generation of predictions for each potentialbranch instruction in a fetch group, other embodiments may implement oneor the other feature. For example, embodiments which implement onlypredicting each potential branch, but which read the predictor at branchtraining time to determine if the prediction is weak for a correctlypredicted branch may be implemented. Similarly, embodiments which onlypredict one branch direction but which make the weak prediction updatedetermination at the time of prediction may be implemented.

FIG. 3 is a block diagram illustrating one embodiment of a predictortable 30A. Other predictor tables 30B-30N may be similar. The embodimentof FIG. 3 illustrates the banking of the predictor table 30A into atleast two banks: an upper bank 50 and a lower bank 52. Each entry in thepredictor table 30A may include a portion in the upper bank 50 and aportion in the lower bank 52. Instruction execution order may begin inthe lower bank 52 and progress upward to the upper bank 54. Thus, thepredictions of the entry (corresponding to a cache line of instructions)may be evenly divided in the embodiment of FIG. 3. That is, half of thepredictions for a cache line may be in the lower bank 52 and the otherhalf of the predictions may be in the upper bank 54.

The index generator 32 may be configured to generate separate indexesfor each bank: index L for the lower bank 52 and index U for the upperbank 50. If the offset of the fetch address identifies a byte in thelower half of the cache line, corresponding to a prediction in the lowerbank 52, then both indexes may be equal and the two halves of the sameentry may be read/written. On the other hand, if the offset of the fetchaddress identifies a byte in the upper half of the cache line,corresponding to a prediction in the upper bank 50, then the index U forthe upper bank 50 may be generated as described above. The index L forthe lower bank 52 may be generated as the upper bank index plus one,selecting the next consecutive entry in the predictor table 30A(corresponding to the next consecutive cache line to the cache linebeing fetched).

It is noted that, while an upper bank 50 and lower bank 52 areillustrated in FIG. 3, other embodiments may implement more than twobanks If a fetch group covers less than half a cache line, for example,power may be saved by using more than two banks because there would becases in which at least one bank would be idle for a read or writeoperation.

Index generation for one embodiment is now described in more detail.According to the O-GEHL algorithm, one selects the minimum (floor)address bits, minimum and maximum global history bits, and the cap onpath history bits to use for the predictor. The number of global historybits to use for index generation for a given predictor table i may bedetermined from the equations given below (if the minimum global historybits, used for the predictor table 30B, is L(1) and the maximum globalhistory bits is L(N−1) for N tables, wherein N is an integer greaterthan zero):

alpha=(L(N−1)/L(1))^(1/N−2)

L(i)=alpha^(i−1) *L(1)

The number of path history bits is the lesser of L(i) and the selectedcap. The path history bits, address bits, and global history bits may beconcatenated, and the resulting value may be hashed to generate theindex.

In order to keep the amount of logic generating the index small (andfast) in some embodiments, the number of bits used in the hash may belimited to a multiple of the index width in bits. That is, each indexbit may be generated from a number of bits of the concatenated value,wherein the number of bits is equal to the multiple. For example, themultiple may be selected as 3, and the hash function may be a threeinput exclusive OR (XOR) of bits from the concatenated value to each bitof index value. As the number of global history bits grows (and thenumber of address bits is reduced to the minimum), the total number ofbits in the concatenated value becomes larger than the multiple timesthe width of the index. In such cases, the concatenated value may besampled at an interval defined by the ratio of the number of bits in theconcatenated value and the multiple times the width. The selected bitsmay also be right-rotated to further modify the indexes generated foreach predictor table.

The address bits used in the index generation may exclude the bits thatdefine the offset within a cache line. That is, since an entry in thepredictor tables 30A-30N includes branch predictor values for a fullcache line, the offset within the cache line identifies a beginningpoint within the branch predictor values for a fetch group, not adetermination of the entry to select. In other embodiments, offset bitsmay be used in the index generation but may also be used to selectbranch prediction values from the entry (and possibly the nextconsecutive entry).

Turning next to FIG. 4, a flowchart is shown illustrating operation ofone embodiment of the branch direction predictor 20 to generate a branchdirection prediction. While the blocks are shown in a particular orderfor ease of understanding, other orders may be used. Blocks may beperformed in parallel in combinatorial logic circuitry in the branchdirection predictor 20. Blocks, combinations of blocks, and/or theflowchart as a whole may be pipelined over multiple clock cycles. Thebranch direction predictor 20 and components thereof as shown in FIG. 2may be configured to implement the operation shown in FIG. 4.

The index generator 32 may be configured to generate the indexes for thepredictor tables (block 60). As mentioned previously, each index may begenerated from a different combination of address, global history, andpath history bits (although the sets of bits used for each index mayoverlap). If the offset of the fetch address is in the upper bank 50 ofthe indexed entries, the index generator 32 may be configured togenerate the indexes+1 for the lower banks of the predictor tables(decision block 62, “yes” leg and block 64). Otherwise, the same indexmay be used in each predictor table for the upper bank and the lowerbank.

The predictor tables may be configured to output predictor values fromthe indexed entry (or entries, if the fetch group crosses a cache lineboundary) (block 66). The adders 36 may be configured to add the branchpredictor values from each predictor table that correspond to a givenposition within the fetch group, outputting corresponding branchpredictions for each position based on the sign of the corresponding sum(block 68). Additionally, the adders 36 may generate the updateindicators for each weak prediction (sum of branch prediction valuesnear zero, e.g. within a threshold of zero—block 70). The branchdirection predictor 20 may be configured to transmit the branchpredictions and update indicators with the corresponding instructions(block 72).

FIG. 5 is a flowchart illustrating operation of one embodiment of thebranch execution unit 18 to execute a conditional branch operation.While the blocks are shown in a particular order for ease ofunderstanding, other orders may be used. Blocks may be performed inparallel in combinatorial logic circuitry in the branch execution unit18. Blocks, combinations of blocks, and/or the flowchart as a whole maybe pipelined over multiple clock cycles. The branch execution unit 18may be configured to implement the operation shown in FIG. 5.

The branch execution unit 18 may be configured to evaluate the conditionor conditions specified by the branch operation to determine if thebranch is taken or not taken (block 80). The conditions may be conditioncode selected from a condition code register, condition codes forwardedfrom another operation, or the result of a comparison specified by thebranch operation, for example. If the branch direction is mispredicted(e.g. predicted taken but determined to be not taken or vice versa)(decision block 82, “yes” leg), the branch execution unit 18 may beconfigured to generate an update for the branch direction predictor(block 84). The update may include the address of the branch, the updatevalid signal indicating that an update is being transmitted, andtaken/not taken result of the branch. Additionally, if the branch iscorrectly predicted but the update indicator indicates update (decisionblock 82, “no” leg and decision block “86”, yes leg), the branchexecution unit 18 may be configured to generate the update for thebranch direction predictor 20 (block 84). It is noted that themisprediction update and the update due to the update indicator may beindependent of each other. That is, it is not necessary to determinethat the prediction is correct to generate the update in response to theupdate indicator. Rather, the update due to the update indicator mayoccur even if the branch is correctly predicted.

FIG. 6 is a flowchart illustrating operation of one embodiment of thebranch direction predictor 20 to update a branch prediction in responseto an update request from the branch execution unit 18. As mentionedpreviously, the update request may be generated by the branch executionunit 18 responsive to a misprediction of a branch or an update indicatorcorresponding to the branch indicates the update. While the blocks areshown in a particular order for ease of understanding, other orders maybe used. Blocks may be performed in parallel in combinatorial logiccircuitry in the branch direction predictor 20. Blocks, combinations ofblocks, and/or the flowchart as a whole may be pipelined over multipleclock cycles. The branch direction predictor 20 and components thereofas shown in FIG. 2 may be configured to implement the operation shown inFIG. 6.

The branch direction predictor 20 may read the branch predictor valuesfrom the predictor tables 30A-30N (block 90). Reading the branchpredictor values may include generating the indexes (and possiblydifferent lower bank indexes if the fetch group crossed a cache lineboundary) for the predictor tables 30A-30N, similar to the original readof the branch predictor values for prediction. The index generator 32may be configured to checkpoint global branch history and path historyfor each branch to permit recreation of the indexes, or these values maybe carried with the branch instructions and returned as part of theupdate requests. Alternatively, the indexes as generated during the readmay be checkpointed in the index generator 32 or carried with the branchinstruction and returned as part of the update requests in addition tothe update address or instead of the update address.

The branch predictor values may be provided to the increment/decrementunit 34, which may be configured to increment each branch predictorvalue associated with the branch if the branch result is taken, ordecrement each branch predictor value associated with the branch if thebranch result is not taken, for embodiments which interpret a positivesign of the sum as predicted taken. For embodiments in which a positivesign of the sum is not taken, the increment/decrement unit 34 may beconfigured to increment each branch predictor value associated with thebranch if the branch result is not taken, or decrement each branchpredictor value associated with the branch if the branch result is taken(block 92). The increment/decrement unit 34 may provide the updatebranch predictor values on the write ports of the predictor tables30A-30N, which may write the update branch predictor values to theindexed entries (block 94).

Turning now to FIG. 7, a block diagram of one embodiment of a system 150is shown. In the illustrated embodiment, the system 150 includes atleast one instance of an integrated circuit 152 (which may include atleast one instance of the processor 10 shown in FIG. 1) coupled to oneor more peripherals 154 and an external memory 158. A power supply 156is provided which supplies the supply voltages to the integrated circuit152 as well as one or more supply voltages to the memory 158 and/or theperipherals 154. In some embodiments, more than one instance of theintegrated circuit 152 may be included (and more than one externalmemory 158 may be included as well). The integrated circuit 152 may be asystem on a chip (SOC) including one or more instances of the processor10 and other circuitry such as a memory controller to interface to theexternal memory 158 and/or various on-chip peripherals.

The peripherals 154 may include any desired circuitry, depending on thetype of system 150. For example, in one embodiment, the system 150 maybe a mobile device (e.g. personal digital assistant (PDA), smart phone,etc.) and the peripherals 154 may include devices for various types ofwireless communication, such as wifi, Bluetooth, cellular, globalpositioning system, etc. The peripherals 154 may also include additionalstorage, including RAM storage, solid state storage, or disk storage.The peripherals 154 may include user interface devices such as a displayscreen, including touch display screens or multitouch display screens,keyboard or other input devices, microphones, speakers, etc. In otherembodiments, the system 150 may be any type of computing system (e.g.desktop personal computer, laptop, workstation, net top etc.).

The external memory 158 may include any type of memory. For example, theexternal memory 158 may be SRAM, dynamic RAM (DRAM) such as synchronousDRAM (SDRAM), double data rate (DDR, DDR2, DDR3, etc.) SDRAM, RAMBUSDRAM, etc. The external memory 158 may include one or more memorymodules to which the memory devices are mounted, such as single inlinememory modules (SIMMs), dual inline memory modules (DIMMs), etc.Alternatively, the external memory 158 may include one or more memorydevices that are mounted on the integrated circuit 152 in a chip-on-chipor package-on-package implementation.

Numerous variations and modifications will become apparent to thoseskilled in the art once the above disclosure is fully appreciated. It isintended that the following claims be interpreted to embrace all suchvariations and modifications.

What is claimed is:
 1. A processor comprising: a branch directionpredictor configured to generate a branch direction predictioncorresponding to a branch instruction, wherein the branch directionpredictor is configured to detect whether or not a branch predictionvalue associated with the branch direction prediction is within achanging threshold and is configured to transmit an indication with eachbranch direction prediction indicating whether or not the branchprediction value is within the threshold; and a branch execution unitcoupled to the branch direction predictor, wherein the branch executionunit is configured to execute a branch instruction, and wherein thebranch execution unit is configured to generate an update request forthe branch direction predictor responsive to the indication indicatingupdate, even in a case that the branch execution unit would not generatean update due to a result of executing the branch instruction.
 2. Theprocessor as recited in claim 1 wherein the branch execution unit isconfigured to detect a misprediction of the branch direction of thebranch instruction, and wherein the branch execution unit is configuredto generate the update request in response to the misprediction.
 3. Theprocessor as recited in claim 2 wherein the branch execution unit isconfigured to generate a branch direction of the branch instruction, andwherein the branch execution unit is configured to generate the updaterequest responsive to the branch direction and the indication indicatingupdate.
 4. The processor as recited in claim 1 wherein the branchexecution unit is configured to generate the update in response to theindication even in the case that the branch direction prediction iscorrect.
 5. The processor as recited in claim 1 wherein the branchpredictor is configured to generate a branch direction prediction forpotential branch instructions in a fetch group according to a firstinstruction set implemented by the processor and a size of the fetchgroup.
 6. The processor as recited in claim 5 wherein the size of thefetch group is fixed and crosses a cache line boundary for at least somevalues of the fetch address.
 7. The processor as recited in claim 5wherein the processor also implements a second instruction set, andwherein a number of the branch direction predictions is less than anumber of the potential branch instructions in the fetch group for thesecond instruction set.
 8. A method comprising: reading a plurality ofbranch prediction tables in response to a fetch address, wherein thetables are configured to output at least one value from an entry of thetable, and wherein an index to each table of the plurality of branchprediction tables is generated differently; summing the values outputfrom the tables, wherein a sign of the sum indicates a directionprediction for a branch instruction; predicting the direction of thebranch instruction responsive to the sign; detecting that the sum iswithin a threshold value of zero responsive to the summing, wherein thetables are trained responsive to the sum being within the thresholdvalue of zero; and transmitting an indication that the tables are to betrained responsive to the detecting that the sum is within a thresholdvalue of zero.
 9. The method as recited in claim 8 further comprising:executing the branch instruction; and updating the plurality of branchprediction tables responsive to the indication even in a case that thedirection prediction is correct.
 10. The method as recited in claim 8further comprising: executing a second branch instruction, wherein thesum corresponding to the second branch instruction is not within thethreshold; detecting that a second direction prediction corresponding tothe second branch instruction is correct; and not updating the pluralityof branch prediction tables responsive to the detecting and the sumbeing not within the threshold.
 11. The method as recited in claim 8further comprising: executing a second branch instruction; detectingthat a second direction prediction corresponding to the second branchinstruction is incorrect; and updating the plurality of branchprediction tables responsive to the detecting and independent of whetheror not the sum corresponding to the second branch instruction is withinthe threshold.
 12. A processor comprising a branch direction predictor,the branch prediction predictor including: an index generator configuredto generate a plurality of indexes responsive to a fetch address andbranch history data; a plurality of branch prediction tables coupled tothe index generator, wherein the tables are addressed using respectiveindexes generated by the index generator, and the tables configured tooutput at least one value in response to the respective index; and atleast one adder configured to: sum the at least one value from thetables to generate a branch direction prediction; generate an indicationthat the sum is within a threshold of changing the branch directionprediction; and transmit an indication with the branch directionprediction indicating whether or not the sum is within the threshold.13. The processor as recited in claim 12 further comprising a branchexecution unit coupled to the branch direction predictor, wherein thebranch execution unit is configured to execute the branch instructioncorresponding to the branch direction prediction, and wherein the branchexecution unit is configured to generate an update request for thebranch direction predictor responsive to the indication indicating thatthe sum is within a threshold of changing direction, even in a case thatthe branch execution unit would not generate an update due to a resultof executing the branch instruction.
 14. The processor as recited inclaim 13 wherein the branch execution unit is configured to detect amisprediction of the branch direction of the branch instruction, andwherein the branch execution unit is configured to generate the updaterequest in response to the misprediction.
 15. The processor as recitedin claim 13 wherein the branch execution unit is configured to generatea branch direction of the branch instruction, and wherein the branchexecution unit is configured to generate the update request responsiveto the branch direction and the indication indicating update.
 16. Theprocessor as recited in claim 13 wherein the branch execution unit isconfigured to generate the update in response to the indication even inthe case that the branch direction prediction is correct.
 17. Theprocessor as recited in claim 13 wherein the branch predictor isconfigured to generate a branch direction prediction for each potentialbranch instruction in a fetch group according to a first instruction setimplemented by the processor and a size of the fetch group.
 18. Theprocessor as recited in claim 17 wherein the size of the fetch group isfixed and crosses a cache line boundary for at least some values of thefetch address.
 19. The processor as recited in claim 17 wherein theprocessor also implements a second instruction set, and wherein a numberof the branch direction predictions is less than a number of thepotential branch instructions in the fetch group for the secondinstruction set.
 20. The processor as recited in claim 12 wherein theupdate includes an update for the branch prediction value correspondingto the branch instruction in each of the plurality of branch predictiontables.